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[57] ABSTRACT 

A computing system in the form of a cluster of a number of 
multiprocessing nodes maintains, in a fault tolerant manner, 
a distribution of configuration data for each of the nodes so 
that each node has a database containing the configuration 
data associated with that node. The database, and therefore, 
the configuration data it contains, associated with any one 
node is substantially identical to that of any other node. A 
process running on one of the nodes is responsible for 
receiving a requests that require modification of the con- 
figuration data. Effecting changes to the configuration data, 
and therefore the distributed databases, includes the steps of 
first writing the requested change to a master audit log, 
distributing the change request to all nodes, receiving back 
from the nodes acknowledgement of the change request 
being effected at the acknowledging node, and then writing 
again to the master audit log that the change has been 
effected throughout the system. The master audit log thereby 
contains a reliable copy of the configuration data maintained 
in the database associated with each node of the cluster so 
that in the event any of the configuration data becomes 
corrupted, it can be replaced with correct data from the 
master audit log. 

6 Claims, 3 Drawing Sheets 
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FAULT TOLERANT METHOD OF processing environment, there should be available configu- 

MAINTAINING AND DISTRIBUTING ration data that provides a description of the cluster. That 

CONFIGURATION INFORMATION IN A description will provide, for example, information such as to 

DISTRIBUTED PROCESSING SYSTEM how many nodes make up the cluster, the composition of 

5 each node, the address of each processor unit of a node, the 

BACKGROUND OF THE INVENTION processes running on or available to the node(s), the users of 

The present invention relates to computing systems of a thc cluster and thcir preferences, and the like. Further, this 

type in which multiple processor units are arranged as a configuration data should remain consistent, accurate and 

cluster of communicatively interconnected nodes, each node continuously updated across the cluster members, and herein 
comprising one or more processor units. In particular, the 10 * introduced areas of attack on the fault tolerant and high 

invention relates to maintaining and distributing to each availability aspects of the cluster. Improper retention and/or 

node configuration data identifying particular characteristics distribution of the configuration data can leave it vulnerable 

of the cluster and its elements in a fault tolerant manner. t0 corruption by viruses, hackers, or even inadvertent, but 

In today's industry, there are certain computing < ™f™*™Z> C0 ™Pf » by a system administrator who 
environments, such as stock exchanges, banks, telecommu 15 ^ « cnoncous change. In addition, the configuration 
nications companies, and other mission critical applications, ™™ across all nodes to allow all 

that do not tolerate well even momentary loss of computing duStCr mcmb ^ t0 agree e g., as to what nodes (and the 

facilities. For this reason such environments have, for many f/ 0 ™ um * * cy t COala j? ) * rC ^Jed where Changes to 
years, relied on fault tolerant and highly available computer on ^ configuration data us^d by one node should also be made 
systems. Ths architectures of such systems range from 20 ° thc / onfi g™ dat * f otter nodes. Thus, distnbu, 

simple hot-standby arrangements (i.e., a back-up computer Uon ° f SUch changes muSt be resistant t0 faults ' 
system stands ready to take over the tasks of a primary M rehance on computer systems continues to permeate 

computer system should it fail) to complex architectures our and as more services move on-line, twenty-four 

which employ dedicated (and replicated) portions of the b ? seven °P eral,0D and accessibility will become critical, 

computing hardware. These latter systems may be most 25 Therefore, fault tolerance and high availability are, and will 

effective in providing continuous availability, since they continue to become exceedingly important. Being able to 

have been designed with the goal of surviving any single offer the same level of fauit tol ^anc& and high availability 

point of hardware failure, but suffer a price premium due to m so^re via clustering, as can be achieved with fault 

the increased component cost needed for component repli- tolerant hardwarc > ^ be very attractive. Highly available, 

cation. But, even with component replication, the architec- 30 fault toleranl > and scalable systems will then be able to be 

ture is sail susceptible to a single point of failure: the created from commodity components and still achieve the 

operating system. One approach to the problem of a single same level of "liability and performance as much more 

operating system is to employ a distributed operating sys- costl y dcdlcatcd f ™H tolerant (FT) hardware. 
te m- 35 SUMMARY OF THE INVENTION 

Distributed operating systems allow collections of inde- ^ present invention provides a method of maintaining 

pendent machines, referred to as nodes, to be connected by a consistent database of configuration data across the inter- 

a communication interconnect, forming a "cluster" which connected nodes of a cluster so that the configuration data 

can operate as a single system or as a collection of inde- remains highly reliable and available in the face of all but the 

pendent processing resources. Fault tolerance can be pro- ^ most disastrous attacks. Changes to the configuration data 

vided by incorporating hardware fault detect with the dis- and distribution of those changes, are handled in a fault 

tnbution of the operating system in the cluster. High tolerant manner so that the configuration data accessible to 

availability is achieved by distributing the system services one node k substantially identical to that of any other node, 
and providing for takes of a failed node by a backup node. Accordi lQ lhe myenti eflch q£ a numb£r 

With this ; approach the system as a whole can still function 4S naultiprocessor nodes of a computing system cluster is 

evenwimtr^lossofoneormoreofthenodesthatmakeup provided a daUbase Qr „ ^ „ for conta y ining configura . 

the cluster Therefore, he operating system will no longer be tion data 0ne of ^ nodes ^ ^ re * idenc * of , 

a single point ot : failure Since the operating system is rf ^ M ^ ^ M 

providing the high availability and fault tolerance, it is no bere( has the responsibilit y 0 f receiving all requests that 
longer necessary to incorporate replicated hardware compo- 50 ire a ch t0 lhe conliguration data in the M ist of 

i h h ^ X ! em ^'T y ^ ' ^ their ^V 0 ! afl y of the nodes > and therefore * e registriesof all the nodes, 
precluded. Tins can alleviate the price premium of fault ^ primary process maintains a ma * er audil log on a disk 

tolerant hardware. $t0fage (hat ^ « mirrored ( j e ^ ^ identical copy of the 

Recently, the clustenng concept has been extended to content of the audil log is kept on a second disk storage unit.) 

computing architectures in which groups of individual pro- 55 Thus, all requests for a change of the configuration data 

cessor units form the nodes of the cluster. This approach maintained in the registry received by the cluster are routed 

allows each node, having two or more processor units to to the primary process. When a request for registry change 

operate as a symmetric multiprocessing (SMP) system is received by the primary process, information concerning 

capable of exploiting the power of multiple processor units the request is first written to the master audit log and mirror 

through distribution of the operating system and thereby «, aud it log. Then, the primary process prepares a message 

balance the system load of the SMP node. In addition, it may containing request data and sends the message to a monitor 

be possible for an SMP configured node to reduce downtime process running on each node (including that node at which 

because the operating system of the node can continue to run the primary process resides). Each monitor process is 

on remaining processors in the event of failure of one responsible for maintaining and providing access to the 

processor. 65 rc gist r y. Upon receipt of the request data, each monitor 

However, in order to employ multiple SMP nodes in a process will access the associated registry of that node to 

cluster, and have them able to operate efficiently as a single make the indicated change and report back to the primary 



06/02/2003, EAST Version: 1.03.0002 



6,092,213 

3 4 

process that the change of the request was accomplished. comprise two or more processor units P. Node 2, on the other 
Upon receipt by the primary process receives change reports hand, includes only one. This is for illustrative purposes 
from all nodes of the cluster, it will write to the master audit only, and how many processor units P are used to make up 
log (and mirror audit log) that the requested change is any one node 12 of a cluster is not relevant to the employ- 
complete. Thereby, all changes to the registry are maintained 5 ment of the present invention, although a limit of 8 processor 
in the audit log so that a complete copy of the registry is kept units P for each node may be sufficient. (The cluster 14, on 
by the audit log. the other hand, may include up to 128 nodes 12.) 

In a further embodiment of the invention, nodes of the The limitation of 8 processor units P per node 12 results 

cluster may be configured to implement the "process pair" from the symmetric multiprocessing operating system used 

technique described in US. Pat. No. 4,817,091. According * 0 for the nodes: Windows NT, which is available from 

to this embodiment of the invention, a process of any node Microsoft Corporation, One Microsoft Way, Redmond, 

can have a backup process on another node somewhere else Wash. (Windows and Windows NT are trademarks of 

in the cluster. Failure of the process will result in its backup Microsoft Corporation). Future editions of the Windows NT 

process in taking over the tasks of the failed process. To operating system, or other operating systems, may allow 

ensure that the backup process is able to pick up from as near 15 increased numbers of processor units to be included in each 

to the point of failure as possible, "checkpoint" data is sent node. 

by a process to its backup so that the backup is kept The nodes are communicatively connected to one another, 

up-to-date as to the activity of the process it is backing up. an d to a num ber of storage elements 16 (16 l( 16 2 , . . . , 16 J 

These checkpoints are made for significant events (those nere reprcS ented as disk storage, by a data communication 

events that are important to a takeover), or more often if ™ QCtwork 15 Thc communication network 15 preferably is 

desired - that disclosed in U.S. Pat. No. 5,574,849, the disclosure of 

This process pair technique is employed in connection which is incorporated by reference to the extent necessary, 

with the primary process. Thus, when the primary process i n order for each processor unit P to keep information 

receives a request to change the configuration data contained concerning its own configuration (the processes it runs, any 

in the registry, an indication of that change is "check- specia j information concerning those processes, special 

pointed" to a backup process for thc primary process— nee ds of users of those processes), as well as information 

before the indication of the request is written to the master needed by each of ^ t nod es 12, each node will maintain 

audit trail. Should the primary process fail before the change configuration data in a registry that is kept, for each node, on 

to the configuration data of the registry is complete, the disk storage. Thus, the registries for node 0, node 1, . . . , 

backup process can either complete the change, depending no de 3 are respectively kept on storage units 16 3 , 16 4 , 16 s , 

upon where in the operation thc failure occurred, or back out and \^ respectively. Each node will have a monitor (MON) 

of the operation in favor of beginning over again. process (FIG. 2) whose responsibility is, among other things 

It will be apparent to those skilled in this art that the not relevant here, to access and maintain the registry, 

present invention has a number of advantages. Configuration 35 Changes and modifications to the respective registries are 

data may now be maintained safe from corruption, and effected by configuration messages that are sent by a central 

distributed in a fault tolerant, reliable manner. Corruption of or primary process to all MON processes, as discussed 

any registry of any one node (or even all nodes) can be further below. According to the present invention, all reg- 

corrected using the content of the master audit log. In islries are maintained by the respective MON processes of 

addition, the mater audit log is kept updated in a manner that the associated nodes so that they (the registries) are identi- 

ensures the configuration data's credibility and reliability. cal. 

These and other features, aspects, and advantages will The system 10 typically can include a number of work 

become apparent upon a reading of the detailed description stations, or other user (or other) input apparatus, from which 

of the invention, which should be taken in conjunction with transaction requests may be received and responded to as 

the accompanying drawing. 45 necessary. (As used here, a transaction is an explicitly 

delimited operation, or set of related operations.) The work 

BRIEF DESCRIPTION OF THE DRAWINGS stat ion 20 is meant to represent such input apparatus. 

FIG. 1 is simplified block diagram illustration of com- One of the nodes 12, for example node 0, is chosen as to 

puting system architecture in the form of a four node cluster; tne residence node of a primary process with the responsi- 

FIG. 2 is a conceptual illustration of the cluster of FIG. 1 50 bilit ? of kee P in & track of the factions received by the 

to show the steps taken by the present invention; and s y stem 10 > where the y are bem £ handle > aod when the 

„^ „ . a .„ , transaction is compete. Transaction monitoring is used to 

FIG 3 .s a flow diagram illustration, showing the steps ensure , ha , , he lransaction completeS) or if % does not 

taken by the present invention to mamtam a safe and reliable complete (e , g . ; because ^ node/processor/process opera.- 

copy of the registry, and to distribute changes to the registry Js ; 0Q , he lransaction fafl$) ^ s wil f either J 

to (he nodes of the cluster of FIG. 1 in a fault tolerant l0 compkle ^ tt3DsMio l or oul t0 , point whe ^ e 

manner the transaction can be restarted (e.g., on another node/ 

DESCRIPTION OF THE PREFERRED processor/process, as the case may be) in an effort to 

EMBODIMENT complete. This primary process (hereinafter "Primary Trans- 

60 action Monitoring Process," or "P-TMF') is also assigned 

Referring now to the Figures, and for the moment spe- the responsibility of receiving all indications ("requests") 

cifically to FIG. 1, there is illustrated a computing system, that will require a change to the registries of the system 10. 

generally designated with the reference numeral 10, com- (For example, registry changes may be necessitated by 

prising a number (four) of processing nodes 12 (node 0, addition or removal of nodes to or from the system 10, or 

node 1, . . . , node 3) that form a cluster 14. Each processing 65 changes in other configuration aspects.) This helps ensure 

node 12, in turn, may comprise one or more processor units that the configuration data contained in the registries asso- 

P Thus, as FIG. 1 illustrates, node 0, node 1, and node 3 ciated with all nodes 12 remains consistent and substantially 
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identical for those reasons stated above. In addition, the change of the configuration data has been, in effect, 

P-TMP will also be responsible for keeping a copy of the requested together with the required information. The reason 

configuration data in a safe repository, here a master audit for this is that if the P-TMP fails during the change 

log or master audit log ("MAT") that is retained by disk operation, the B-TMP, using the information concerning the 

storage unit 16^ Of course, the master audit log may also 5 change, can either continue the change operation until 

keep other important information that, like the configuration completion, or back it up and start over, when it takes over 

data, needs to be copied so that in the event such information ^ or tiie failed P-TMP. 

is corrupted, lost, or otherwise rendered suspect, it can be Nexl > after lne "checkpoint" operation to B-TMP, step 38 

recreated and replaced from the copy. (A discussion of an sees tne p "™ p writing the indication of configuration data 

audit log may be found in U.S. Pat. No. 5,590,275.) A 10 cnan S e to the MAT. The process (a disk process) that writes 

mirrored copy of the MAT, MAT, is kept on a separate [ he information to the MAT will also write that same 

storage facility: disk storage unit 16 2 . information to the mirrored volume retaining the MAT. If 

Turning now to FIG. 2, the nodes of the system 10 (FIG. ? '™? n fails ' and the r BTMP is rc ^ uircd to takc ovcr > 

1) are shown in a conceptual form to represent the processes ®~™ P Can L rcl ™ vc informatlOD &om the MAT in order to 

used for the preseot invention, and their interrelation to 15 determine how best to proceed with the change (i.e., to 

implement the invention. FIG. 3 broadly illustrates, in flow continuc > or to back-out). 

diagram form, the operations of the invention. ^ P '™ p continues at step 40 by distributing the 

Referring first, however, to FIG. 2, shown, as indicated n *&Y t0 . ca ? h ° f ^^' S ° f ° f lhe 

above the P-TMP installed and running on the SMP envi- ™i Udmg - the n ° de h ° ° n ^ P :™ P 15 m ° umed ' 

ronment provided by the processor units P of node 0. Its 20 S ° ft ? re . Umer * the " S * b * *\ P ™ ? d the 

backup, B-TMP, is installed on node 2, although any other ^ ^ ^ V 6 M0N * <**P 40 )> the 

nn . / v r ui j n\ i_ j , P-TMP will wait for acknowledgements that the chance 

node (except, preferably, node 0) may be used to accom- t • . , - , b . , " , T , b 

modate B-TMP. In addition, lhe monitor process (MON) is TZ™ * ° f "* * "* ^ 

installed on each of the nodes 12 which, among its various pa ? Q * r M ° N P r ^.™£* » error .n attempUng to 

taste, is the responsibility of accessing mainlining, and 25 ^ " s ., assoclatcd REG, the MON process will 

modifying the registry for the (REG) Jsociated node 12. "™? * »°™ n , P t , t TT ? T""? ° f ^ 

- .. particular node to shut the node down so that no data errors 

FIG. 3 diagrammatically illustrates the steps taken to are propagated from the node to the rest of the nodes of the 

modify the registry associated with each of the nodes. It will cluster 

S'mZSp'SlL aC,Ual p,r m -T n f iCa 'l 0n ITS*" 30 Preferably, the system 10 utilizes a form of the "I'm alive" 

v ^ /nTn Z ?H a i h , « ,Tl n h 2) T, " " in which « ch node Periodically transmits to all 

fh , ^ ; COmmumC f a " on between other nodes a message indicative of that node's continuing 

SnTST h 8 a n ^ lhe P r0CeS ^ r 8° od health ' ,f an A> ive «»««ge ^ not received from I 

units P) and any other process in the system 10 or the nod b , he ren)aini nod ^ node fe . considered (0 

cLTf.nlTnn, I V f ^ IFV™ » haVe 0r h ' Ve "*« re ™ V «<> a " d P"™P Will be 

commumcauons network 15. Thus, for example, the P-TMP informed accordi , so that i( can formu , a(e a . 

in Jh ? , ,h , mirror, MAT, are removed Qode (The absence of a „ ^. mess 

kept. In addition, die communications network 15 provides a node wju be noled „ ^ bac|c ocesse . f 8 

the medium for allowing the P-TMP to communicate with „ t,,„„ nr u.A .„„-..Lj c / ! 

nnv/D™ j , ..„„ ,. , " 40 have or had associated primary processes on the failed node. 

i'l l htnM f P T S T £ DOdeS 12 ' 7116 lack of ln Ali« message will prompt those backup 

War y, the MON processes of each node 12 communicate ^ (0 ; imo ac(ion % ^ ^ \ he P 

w m .heir associated reg,s t ry (REG) also through the com- correspondi > es lhat were on ^ nQW ^ nl 

munications network 15. Preferably, the communications ™i„\tu * ■ . ^-c • .t. ■ , . , , 

,,,, - p.u.i. ■ no n v, node.) That registry modification is then checkpointed to the 

network 15 will take the form of that shown in U.S. Pat. No. Ae r™p/„; tU ♦ ■ * *u c i j /\ 

c 0((n uu ... * >|. . . , ■ ,„ 45 D-IMr (assuming that it is not the failed node), written to 

5,574,849, although those skilled in this art will readily see thp w AT /, n j t uJL- XjfAT x^a^\ a a- < 'L ^ 

, V % . - . . 7 , the MAi (and the mirror MAT, MAT), and distributed to the 

that other forms of communication networks may be used. r^-,;^;™ ,u , ■ .V , ., , 

^ nodes remaining m the system, in the manner described 

Referring now to FIG. 3, and taking it in conjunction with above. 

FIG. 2 the steps for updating the configuration data main- In an embodiment of this invention, each of the nodes 12 

tamed by the registries REG will now be described. Assume 50 wiU have insulled and mnni a cluslcr mana (CM) 

that the system 10 receives an indication that the configu- process (nol shown) responsible for k { track of V wha ; 

ration of that system changes. For example, a new user logs nodes are in the system, what processor units are in what 

onto the system 10 from the workstation 20 (FIG. 1) to use nodes , etc . ^ ^ a CM on each node continually com- 

a process installed on node 3. The, presence of a new user, municating with the CMs of the other nodes, and all com- 

and the processes) that will be employed by that user, and 55 munications sent require a reply. In view of this frequency 

other information, are matters pertaining to the configuration of messaging between all nodes, it is believed unnecessary 

of the system, and kept m the configuration data of the l0 add l0 lhe communication traffic carried by the network 

system registry (i.e., the registries maintained by each node 15 . ^ when lhe CM of any node n reaJizes ^ u ^ 

12). The configuration data maintained by the registries must not heard from lhe CM of any other node (e node n2 

be updated to account for the new user, the workstation 60 FIG. 1) within a predetermined period of time, it will declare' 

being used, the processes) invoked, and any other addi- that node » dead> » and communicate that declaration to the 

tional information needed by the nodes 12. The particulars remaining nodes (i.e., their CMs). This achieves the same 

concerning the new user will be routed to the P-TMP as rcsu h ^ the "I'm Alive" transmissions, but at less expense 

shown by the flow diagram 30 at step 32. t0 overal] system and node performancc . 

When the information is received by the P-TMP (step 32), 65 What is claimed is: 

it will first transmit a communication to the B-TMP (step 34) 1. A method of maintaining a consistent, fault tolerant 

on node 2 with information indicative of the fact that a database of configuration data in each of a number of 
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processor units communicatively intercoupled to form a 
multiple processor system, each of the number of processor 
units maintaining a copy of the database configuration data 
that is a substantial copy of the database of configuration .. 
data maintained by each of the other of the number of 5 
processor units, including the steps of: 

receiving a request to modify the database at a one of the 

number of processor units; 
the one processor unit operating to 
write the information corresponding to the request data 10 

to a master audit trail file, 
send a modify message corresponding to the request to 
the number of processor units, including itself, 
each of the number of processor units sending an 
acknowledgment message to the one processor unit that 
the modification as indicated in the modify message is 
made, and 

upon receiving acknowledgments from all the processor 
units, making permanent in a master log information 2 o 
indicative of the change. 

2. The method of claim 1, wherein another of the number 
of processor units serves as a backup processor, and includ- 
ing the step of the one processor unit operating to send a data 
message to the backup processor unit with information 25 
corresponding to the request data. 

3. The method of claim 2, wherein the step of operating 
to send a data message to the backup processor unit occurs 
before the step of operating to write the information. 

4. In a multiple processor system of a type including a 39 
communicatively intercoupled Dumber of processor nodes, 
each processor node comprising one or more processor units 
interconnected to form a symmetric multiple processing 
system, each of the processor nodes having associated 
therewith configuration data that is a substantial copy of the 35 
configuration data associated with each of the other of the 
processor nodes, the multiple processor system from time to 
time receiving requests that require changes to the configu- 
ration data, a method of keeping consistent the configuration 
data associated with each of the number of processor nodes, 40 
including the steps of: 

providing each of the number of processor nodes with a 
monitor process responsible for maintaining configu- 
ration data associated with such node; 

designating a one of the number of processor nodes to 45 
give residence to a primary process operating to, 

(a) receive the requests requiring a change or modifi- 
cation of the configuration data, 

(b) write an indication of the change of each received 
request to a master log, 50 

(c) send a data message for each received request to 
each of the monitor processes indicative of the 
change to the configurations data; 

each of the monitor processes receiving the data message 
to effect changes in the configuration data contained in 
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the associated database of configuration data and send- 
ing an acknowledgment to the primary process message 
that the change to the database of configuration data is 
complete; and 

writing an indication to the master log that the change is 
complete if all monitor processes respond with an 
acknowledgment. 

5. A method of main taming a consistent, fault tolerant 
database by a multiple processor system that includes a 
number of processor units, including a first processor unit, 
communicatively intercoupled to one another, the first pro- 
cessor operating to maintain a master database, the other of 
the number of processor units operating to maintain sub- 
stantially identical copies of the master database, the method 
including the steps of: 

receiving at the first processor unit a request to modify the 

database; 
the one processor unit operating to 

(a) write information corresponding to the request data 
to a master audit trail file, 

(b) send a modify message corresponding to the request 
to the other of the number of processor units; 
each of the other of the number of processor units 

receiving the modify message to 

(c) perform a modification to the associated database as 
indicated in the modify message, and 

(d) reply to the modify message to inform the one of the 
number of processor unit that the modify message 
was received. 

6. A multiple processor system, comprising: 

a number processor nodes communicatively intercoupled 
to one another, each of the processor including, 
one or more processor units in a symmetrical multi- 
processing configuration, 
a configuration database, the configuration database of 
each processor node being substantially identical to 
the configuration database of each co the other 
processor nodes, 
a monitor process operable to maintain and make 
changes to such configuration database; 
the one of the processor units having a primary process 
operating to receive a request for a change to the 
configuration database associated with each of the 
number of processor nodes and to 
write an indication of the change to a master log, 
send a data message to each of the monitor processes to 
effect a change to the associated configuration data- 
base; 

receive from each of the monitor processes an acknowl- 
edgment that the change to the corresponding con- 
figuration database is complete, and complete the 
write of the indication of the change to the master 
log. 
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